Abstract: BotometerLite is advertised as a lightweight bot detector that improves scalability by focusing on only user profile information; furthermore, BotometerLite claims that using fewer features only entails a small compromise in individual accuracy. We test the validity of this claim by comparing Botometer with BotometerLite bot likelihood scores for 75,000 users across 5 data sets. We randomly sampled 15,000 users from the following data sets: Coronavirus, 2016 election, News outlets, Charlottesville, and the Twitter API. BotometerLite scores varied drastically from Botometer scores.

Introduction

Botometer is one of the most popular bot detection tools used in social science Rauchfleisch and Kaiser (2020). However, due to Botometer API rate limits, Beskow et al. (2018) recommends a tiered framework for bot detection and suggests models that focus only on user profile information can be used at scale for general estimates of bot penetration.

Yuan, Schuchard, and Crooks (2019) used DeBot for large-scale bot annotations when examining tweets related to the 2015 California Disneyland measles outbreak. Whereas, Broniatowski, Hilyard, and Dredze (2016) used Botometer for small-scale bot annotations.

Dunn et al. (2020) annotated bots based on Botometer scores of 0.5 of greater when assessing the limited role of bots in spreading vaccine-critical information. Botometer’s FAQ page explicitly states “It’s tempting to set some arbitrary threshold score and consider everything above that number a bot and everything below a human, but we do not recommend this approach. Binary classification of accounts using two classes is problematic because few accounts are completely automated”. Instead, Botometer recommends setting a threshold on the CAP score. Dunn et al. (2020) acknowledges the imprecision inherent in bot detection and states the conclusions of this study are robust to differences related to imprecision of the bot proportion estimate.

Botometer was initially launched in May 2014 and BotometerLite was released in September 2020. BotometerLite improves scalability by focusing on only user profile information; furthermore, BotometerLite claims that using fewer features only entails a small compromise in individual accuracy Yang et al. (2020). The training and performance evaluation of BotometerLite is described in “Scalable and Generalizable Social Bot Detection through Data Selection” Yang et al. (2020).

Rauchfleisch and Kaiser (2020) found Botometer scores are imprecise at estimating bots, especially in a different language, and prone to variance over time a high number of human users as bots and vice versa.

Contribution

Many researchers annotate bots based on Botometer score thresholds, in line with the precedent established in previous literature (add citations). Understanding how BotometerLite performs in comparison to Botometer is critical to prevent people from thinking BotometerLite can be used as a scalable substitute for Botometer.

Research Questions

In this study, we seek to answer the following questions:

  • How similar are Botometer and BotometerLite ratings?
  • Is BotometerLite effective at identifying specific types of bots? In other words, are BotometerLite scores strongly correlated with any of the Botometer category scores (e.g., spammers, fake followers, etc.)?
  • Can BotometerLite be used as an triage tool to identify a subset of accounts that require more extensive evaluation via Botometer?
  • Do some topics have more assessed bots than others? How do the topic-specific bot category scores compare to a random sample of twitter users?

Bot Type Scores

The Botometer FAQ section assigns bot scores based on the following categories:

  • Astroturf: manually labeled political bots and accounts involved in follow trains that systematically delete content
  • Fake follower: bots purchased to increase follower counts
  • Financial: bots that post using cashtags
  • Self declared: bots from botwiki.org
  • Spammer: accounts labeled as spambots from several datasets
  • Other: miscellaneous other bots obtained from manual annotation, user feedback, etc.

Complete Automation Probability is defined as the probability, according to our models, that an account with this score or greater is a bot.

The Botometer website uses the CAP to express the percentage of accounts with bot score above a given account that are labeled as humans. Think of this as the chances that you would wrongly classify a human as a bot if you used this account’s score as a threshold. You would want this probability to be pretty small, say less than 5%. (For the statisticians, this is a p-value.)

Methodology

  1. Randomly sample 20,000 tweet IDs from GWU’s Tweet Sets Library in the following collections:
  2. Rehydrate the tweet IDs via Twarc to obtain user IDs.
  3. Randomly sample 12,000 unique user IDs from the 20,000 tweets (some users are duplicated in the initial 15K sample).
  4. Enrich 10,000 user IDs with Botometer scores (not all users will return a Botometer score)
  5. Enrich with BotometerLite scores
  6. Filter tweet IDs to only those with an English status language.
  7. Compare distribution of CAP scores across data sets.
  8. Compare proportion of bot category scores > k across data sets (e.g., how many accounts had astroturf scores greater than k in each data set? did one data set have siginificantly more than others?)
  9. Calculate correlation between Botometer and BotometerLite scores.

Results

The following preliminary results explore the similarity between Botometer and BotometerLite scores for users from the Coronavirus and 5G data sets, as well as a random sample of tweets collected on 1 November 2020.

Bot Proportions by Type

The table below shows the number of accounts with raw English scores greater than \(k = 0.75\). In this example, I did not exclude accounts with non-English status languages.

Category Coronavirus Election 2016 News Outlets Charlottesville 5G Random
Sample Size 10000 - - - 8677 4146
Astroturf 1443 (0.14) - - - 512 (0.06) 291 (0.07)
Fake Follower 904 (0.09) - - - 746 (0.09) 710 (0.17)
Spammer 134 (0.01) - - - 216 (0.02) 113 (0.03)
Financial 86 (0.01) - - - 65 (0.01) 33 (0.01)
Self Declared 274 (0.03) - - - 451 (0.05) 384 (0.09)
Other 4763 (0.48) - - - 1705 (0.2) 2232 (0.54)
Overall 3577 (0.36) - - - 1360 (0.16) 1591 (0.38)
CAP 8090 (0.81) - - - 4617 (0.53) 3442 (0.83)
BotometerLite 550 (0.06) - - - 623 (0.07) 4146 (1)

The table below shows the number of english accounts with raw English scores greater than \(k = 0.75\). In this example, I only icluded accounts with an English status language. Applying English bot detection metrics to non-English users resulted increased bot detections. Therefore, users must be careful to filter on status language to prevent inflating bot counts.

Category Coronavirus Election 2016 News Outlets Charlottesville 5G Random
Sample Size 5453 - - - 8677 1714
Astroturf 1107 (0.2) - - - 512 (0.06) 186 (0.11)
Fake Follower 368 (0.07) - - - 746 (0.09) 161 (0.09)
Spammer 65 (0.01) - - - 216 (0.02) 32 (0.02)
Financial 44 (0.01) - - - 65 (0.01) 10 (0.01)
Self Declared 137 (0.03) - - - 451 (0.05) 88 (0.05)
Other 1518 (0.28) - - - 1705 (0.2) 400 (0.23)
Overall 1387 (0.25) - - - 1360 (0.16) 344 (0.2)
CAP 3778 (0.69) - - - 4617 (0.53) 1156 (0.67)
BotometerLite 311 (0.06) - - - 623 (0.07) 1714 (1)

Botometer Score Distributions

Raw Score Correlations

BotometerLite is most similar to the Botometer fake follower and spammer scores with \(R^2\) values of 0.346 and 0.276, respectively. Hence, if Botometer scores are accurate, BotometerLite may be somewhat effective at identifying some fake followers and spammers.

Expected Bot Count by CAP

CAP is the probability that an account with this score or greater is a bot. Therefore, if we model an accounts bot status as a Poisson binomial random variable, the expected number of bots is given by:

\(E[\sum x_{i}]=\sum p_{i}\)


Hence, we should expect 8884 (56.1%) of the 15844 accounts to be bots.

Expected bots by data set:

  • Coronavirus: 3329 (61.1%) of 5453 accounts.
  • 5G: 4527 (52.2%) of 8677 accounts.
  • Random: 1028 (60%) of 1714 accounts.

Conclusion

Future work for course project:

  • Update introduction to include other articles that have critiqued Botometer
  • For EM6574 only, replicate results of Indiana University BotometerLite paper (Train a classifier to predict manually labeled bots and compare with BotometerLite)
  • Post code to github repo

Questions:

  • Should I expand this beyond Botometer and include a comparison of scores generated from other models (e.g., DeBot, BotSlayer, etc.)
  • Has anyone used DeBot? I requested an API key but did not receive a response.
  • Do I have “the right” number for each of my samples (10,000 users per data set)? Is a random sample appropriate?
  • I forgot to filter on english only tweets when rehydrating the coronavirus tweets and pulling the random sample. Do I need to repull my data??? Can I cut my sample size down so I don’t have to repull all the data?
  • What statistical tests should we do to provide evidence Botometer and BotometerLite produce different results? t-test for difference of means? F-test for difference of variance? Hotelling test across all bot scores?
  • Should we look at just the scores derived from english-speaking accounts or also include the universal scores?
  • What visualizations should we use? tSNE separating accounts with CAP > 0.8 from those with CAP < 0.8?

The pearson correlation matrix (\(R^2\) values are the square of the values of this matrix) also shows the scores are weakly correlated.

#

References

Beskow, David, Kathleen M Carley, Halil Bisgin, Ayaz Hyder, Chris Dancy, and Robert Thomson. 2018. “Introducing Bothunter: A Tiered Approach to Detection and Characterizing Automated Activity on Twitter.” In International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation. Springer.

Broniatowski, David A, Karen M Hilyard, and Mark Dredze. 2016. “Effective Vaccine Communication During the Disneyland Measles Outbreak.” Vaccine 34 (28). Elsevier: 3225–8.

Dunn, Adam G, Didi Surian, Jason Dalmazzo, Dana Rezazadegan, Maryke Steffens, Amalie Dyda, Julie Leask, Enrico Coiera, Aditi Dey, and Kenneth D Mandl. 2020. “Limited Role of Bots in Spreading Vaccine-Critical Information Among Active Twitter Users in the United States: 2017–2019.” American Journal of Public Health 110 (S3). American Public Health Association: S319–S325.

Rauchfleisch, Adrian, and Jonas Kaiser. 2020. “The False Positive Problem of Automatic Bot Detection in Social Science Research.” Berkman Klein Center Research Publication, nos. 2020-3.

Yang, Kai-Cheng, Onur Varol, Pik-Mai Hui, and Filippo Menczer. 2020. “Scalable and Generalizable Social Bot Detection Through Data Selection.” In Proceedings of the Aaai Conference on Artificial Intelligence, 34:1096–1103. 01.

Yuan, Xiaoyi, Ross J Schuchard, and Andrew T Crooks. 2019. “Examining Emergent Communities and Social Bots Within the Polarized Online Vaccination Debate in Twitter.” Social Media+ Society 5 (3). SAGE Publications Sage UK: London, England: 2056305119865465.